Show Case


Excel Data Reformat

Overview

This will be a quick summary for those who do not want to know the detailed implementation.
This Program solved a data input process that took up to 5 days to complete. It was highly tedious and required regular intervention.

The program takes in three excel sheets Sheet1, Sheet2, Sheet3. These sheets contain copied data directly from pdf files. It then churns out four sheets into Final_Format, Data 1, Spare Data, Data 2. Final_Format contains general data, a sort of table of contents overview. It is a cleanup of extra lines and unnecessary data in Sheet1. Data 1, Spare Data, and Data 2 contain data that is a more granular view of the data. This data is retreived via regular expression from Sheet2 and Sheet3.

Input


*** The Examples provided below do not represent the true data that was submitted and is only a small example set of the data that the program is required to sift through and gather the desired data. ***

Sheet1

General Data
General Data Sheet

The data that was exported from a pdf file into excel. It is unorganized and contains alot of non-pertinent data. Because of this there is a large function that breaks out the relevant data and deletes extra rows and columns.

Sheet2

The Sheet2 sheet contains the General Table Of Contents (TOC).

General Table Of Contents Data
General TOC Example Data

Once again this is data exported directly from a pdf file into excel. The image provides example of the majority of random orientations the data was presented to my program. My program had to detect the entire section number, as well as get the entire title of the section.

Sheet3

Finally there is the Sheet3 sheet. This sheet contains the paragraph numbers and title of the General "chapters".

Paragraph Table Of Contents Data
Paragraph TOC Example Data

This is the general format for each of the chapters. There is three parts Data 1, Data 2, Data 3. Each of these parts can have 0 to 50 paragraphs each within any given General Section. This quickly adds up and these sheets can have a couple thousand rows of data with weird outlier cases sprinkled throughout.

Output

The Final_Format Workbook is the product of these three sheets.

Final_Format

Sheet one is named Final_Format, it contains the General Data.

General Final Format
General Formatted

From Sheet 1 of the input we gather the General Section, General Title, GD#, and Classification. From Sheet 3 the Paragraph and Descripton are provided

Data 1 Sheet

Paragraph Data 1 Final Format
Paragraph Data 2 Formatted

The second Data 2 sheet provides the breakdown for all data gathered from Data 2 part of the section TOC. Here the General Title and General Section are gathered from the input sheet 2. The Paragraph# and Paragraph Title are from input sheet 3.

Spare Data is formatted the same as Data 1 and 2, it has no data at this time and will be populated manually.

Data 2 Sheet

Paragraph Data 3 Final Format
Paragraph Data 3 Formatted

This is output sheet 4 which is much like sheet 2 only it gathers the Data 3 parts of each section.


Break Down

This program runs out to 685 lines of code so I will not be going into full detail. Instead I will jump into some of the more interesting implimentations used to organize the data.

GeneralReformat()

                            
Sub GeneralReformat()
    Dim GeneralSheet As Worksheet
    Dim TitleSheet As Worksheets
    Dim DataOneSheet as Worksheet
    Dim DataTwoSheet As Worksheet
    Dim FinalSheet As Worksheet
    Dim SpareDataSheet As Worksheet

    Dim NameCheck As Boolean
    NameCheck = CheckWorkbookNames
    If NameCheck <> True Then
        Exit Sub
    End If
    
    Set GeneralSheet = Workbooks("Data Formatter.xlsm").Worksheets("Sheet2")
    Set TitleSheet = Workbooks("Data Formatter.xlsm").Worksheets("Sheet3")
    
    ' Format the SpecTitle sheet
    GeneralSheet.Activate
    GeneralSheet.Copy After:=GeneralSheet
    ActiveSheet.Name = "FormattedGeneral"
    Set GeneralSheet = Workbooks("Data Formatter.xlsm").Worksheets("FormattedGeneral")
    SpecTitleFormat GeneralSheet

    ' Format the TitleSheet
    TitleSheet.Activate
    TitleSheet.Copy After:=TitleSheet
    ActiveSheet.Name = "FormattedTitles"
    ' Create sheets of parts 2 and 3 of the sections
    Sheets.Add(After:=Sheets("FormattedTitles")).Name = "Data 2 Sheet"
    Sheets.Add(After:=Sheets("FormattedTitles")).Name = "Data 1 Sheet"
    Set TitleSheet = Workbooks("Data Formatter.xlsm").Worksheets("FormattedTitles")
    Set DataOneSheet = Workbooks("Data Formatter.xlsm").Worksheets("Data 1 Sheet")
    Set DataTwoSheet = Workbooks("Data Formatter.xlsm").Worksheets("Data 2 Sheet")
    ' Insert the column titles for the parDatat sheets
    PartSheetFormat DataOneSheet
    PartSheetFormat DataTwoSheet
    ' populate the data into the Data sheets
    ParagraphTitleFormat TitleSheet, DataOneSheet, DataTwoSheet, GeneralSheet
    PartSheetFormat_Final DataOneSheet
    PartSheetFormat_Final DataTwoSheet


    
    ' Initial Formating to remove all extra rows and concatinate the values
    Set FinalSheet = Workbooks("Data Formatter.xlsm").Sheets(1)
    FinalSheet.Copy After:=FinalSheet
    Set FinalSheet = Workbooks("Data Formatter.xlsm").Sheets(2)
    FinalSheet.Name = "Final_Format"
    FinalSheet.Activate
    RemoveData FinalSheet
    
    ' With the first sheet input the Data for all GD's
    Dim Bottom As Integer
    FinalColumnLabels FinalSheet
    FinalDataInput FinalSheet
    Set Bottom = FinalSheet.Range("F65536").End(xlUp)
    SpecFormat Bottom
    
    ' Move All sheets to final format workbook and add spare parts sheet
    Application.ScreenUpdating = False
    Application.DisplayAlerts = False
    Workbooks.Add.SaveAs Filename:="Final_Format"
    FinalSheet.Move Before:=Workbooks("Final_Format.xlsx").Sheets(1)
    DataOneSheet.Move After:=Workbooks("Final_Format.xlsx").Sheets(1)
    DataTwoSheet.Move After:=Workbooks("Final_Format.xlsx").Sheets(2)
    Workbooks("Final_Format.xlsx").Sheets(4).Delete
    Workbooks("Final_Format.xlsx").Sheets.Add After:=Workbooks("Final_Format.xlsx").Sheets(2)
    Workbooks("Final_Format.xlsx").Sheets(3).Name = "Spare Parts"
    Set SpareDataSheet = Workbooks("Final_Format.xlsx").Sheets(3)
    PartSheetFormat SpareDataSheet
    PartSheetFormat_Final SpareDataSheet
    
    ' Remove created sheets in Data formatter
    Workbooks("Data Formatter.xlsm").Activate
    Sheets(3).Delete
    Sheets(4).Delete
    Application.DisplayAlerts = True
    Application.ScreenUpdating = True
    Workbooks("Final_Format.xlsx").Activate
    Sheets(1).Activate
End Sub
                            
                        
Main Function

First a quick overview of what the main function does. It starts with a little bit of error checking to be sure all the sheets that need to be there are in fact there. Next some worksheets are defined for future use I create a new sheet for the final product of the FormattedGeneral. This is for easier access to the data I need for populating the other sheets.
The ParagraphTitles are then formatted the same operation is done for the GeneralSheet. At the same time the data for Parts 2 and 3 are pulled and formatted into their respective sheets.
Then the Data sheet is formatted, all excess rows and columns are removed. Populating the GD numbers is part of the General Data.

The final few operations are to pull the sheets from the current formatter workbook into their own workbook. The empty SpareDataSheet is created. The last thing to do is to clean up the original sheet, and turn on the UI elements that were disabled.

CheckWorkbookNames

I feel it is pertinent to mention the use of error checking in this program as it is one of my few implimentations, sadly. I have only recently began to impliment error checking as it was never neccissary.

                            
IsOpen = False
For Each WB In Workbooks
    If Not WB.Name <> "Data Formatter.xlsm" Then
        IsOpen = True
        Exit For
    End If
Next WB
If IsOpen <> True Then
    MyMsg = MsgBox("Check your Workbook Name", vbOKOnly)
    CheckWorkbookNames = False
    Exit Function
End If    
                            
                        
Check Workbook

All the obvious objects that are interacted with are checked for. It makes sure Data Formatter.xlsm is the current workbook.

                            
If Workbooks("Data Formatter.xlsm").Sheets.Count <> 3 Then
    MyMsg = MsgBox("Check Sheets, Sheets Names expected: Sheet1, Sheet2, Sheet3", vbOKOnly)
    CheckWorkbookNames = False
    Exit Function
End If
For i = 1 To Workbooks("Data Formatter.xlsm").Sheets.Count
    With Workbooks("Data Formatter.xlsm")
        If Not .Worksheets(i).Name <> "Sheet1" Then
            Sheet1 = True
        ElseIf Not .Worksheets(i).Name <> "Sheet2" Then
            Sheet2 = True
        ElseIf Not .Worksheets(i).Name <> "Sheet3" Then
            ParaTOC = True
        End If
    End With
Next i
                            
                        
Check Sheet Count

That the Sheets.Count == 3

                            
If Sheet1 <> True Or Sheet2 <> True Or ParaTOC <> True Then
    MyMsg = MsgBox("Check Sheets, Sheets Names expected:" & vbCrLf & vbCrLf & "Sheet1, Sheet2, Sheet3", vbOKOnly)
    CheckWorkbookNames = False
    Exit Function
End If

CheckWorkbookNames = True
                            
                        
Check Sheet Names

That each sheet has one of the exepected names Sheet1, Sheet2, Sheet3.

If all cases are valid the CheckWorkbookNames returns true. Otherwise an error is displayed and CheckWorkbookNames is set to false and the function exits. This will set NameCheck to equal false and cause the Sub to exit.

PopulateDataSheet()

I utilized quite a bit of regex expressions through the data collection process. Most notibly in this function

                            
ParagraphCell.Pattern = "^(\d{1,2}\.\d{1,2})"
SectionNum.Pattern = "SECTION (\d{2}\s\d{2}\s\d{2})$"
SectionNumLong.Pattern = "SECTION (\d{2}\s\d{2}\s\d{2}\.\d{2}\s\d{2})$"
PageNum.Pattern = "Page \d\s"
PartTwo.Pattern = "^[2]\.(\d{1,2})"
PartThree.Pattern = "^[3]\.(\d{1,2})"
ReplacePart.Pattern = "\s+DATA \d(\s*\w*)*$"
With MultipleParaNum
    .Pattern = "((\d+\.{0,1}){2,8}\s+)(\w*\s*\(*\)*-*)+[$|(!2-3]?"
    .Global = True
End With
                            
                        
Regular Expressions

These are used to find and to seperate the numbers for populating into the PartTwo and PartThree sheets. There is also some data that is similar throughout the pages that required for some exception checks. These are the PageNum, ReplacePart, and MultipleParaNum patterns.

In the case of the MultipleParaNum there was an exception where multiple paragraph numbers were put into the same cell. I was unable to programatically seperate these and so the Check Cell column was created. Anytime one of these values was found the cell number is noted and put into this column.

                            
If MultipleParaNum.Test(CurrentCell) <> False Then
    Set ExtactedValues = MultipleParaNum.Execute(CurrentCell)
    If ExtactedValues.Count > 1 Then
        CurrentCell.Interior.ColorIndex = 6
        CurrentCell.Offset(1, 0).EntireRow.Insert
        i = i + 1
        CurrentCell.Offset(1, 0) = CurrentCell.Value2
        Set CurrentCell = Range("A" & i)
        ' Set the row location for problem cell in the Data sheet
        CheckCell = True
        If Left(CurrentCell, Left(CurrentCell, 1)) = 2 Then
            ErrorCount = DataOneSheet.Range("J65536").End(xlUp).Row + 1
        ElseIf Left(CurrentCell, Left(CurrentCell, 1)) = 3 Then
            ErrorCount = DataTwoSheet.Range("J65536").End(xlUp).Row + 1
        End If
    End If
End If
                            
                        
Checking for Multiple Section Numbers in a Cell

This example also shows a major fault that runs as a theme throughout the program. That being the excessive length of the functions. This portion could very easily be broken out into its own function.

After all of these checks The function does what it says it does and populates the section number and paragraph titles into the Data 1 and Data 2 sheets.

StyleDataSheet()

Following the data population the Data Sheets are styled. This means that the columns are given the correct width and font styles are applied.

PopulateGeneral()

There is some similar Regular expressions used here but the special cases where different so I had to accomodate for them

                            
FirstCellFormat.Pattern = "^(\d{1,2}(\s|$))"
NoAZCell.Pattern = "^[^a-zA-Z]+$"
ConcatCell.Pattern = "^(\d{2}\s\d{2}\s\d{2}\s)"
ConcatCellDot.Pattern = "^(\d{2}\s\d{2}\s\d{2}\.\d{2}\s\d{2})"
                            
                        
Regular Expressions

The FirstCellFormat will detect if the cell overflowed into the cell below and concatinate the values.
NoAZCell is used because cells with letters have their own way of being handled.
ConcatCell and ConcatCellDot are basically different varieties of what the General number could look like.

PopulateFinal()

                            
FindSection.Pattern = "^(\d{2}\s){2}(\d{2})"
FindSD.Pattern = "^(GD-\d{2})"
FindParagraph.Pattern = "^(\d{1,2})\."
FindPart.Pattern = "^Data\s"
                            
                        
Regular Expressions

Each cell in the FinalSheet is tested for certain characteristics. Depending on these different patterns the data will be modified.
FindSection is used to find cells that do not contain a section number and removes these cells.
FindSD is used to avoid modifying cells that contain a GD number.
The cells that contain the FindParagraph and FindPart patterns have data that I will want to merge if there is overflow.

FinalDataInput()

The most important peice of this function was the population of the GD number. This was a problem that took quite a bit of time to solve.
I needed to populate the cells from the occurance of one GD to the occurance of the next GD number with the value of the first GD number. The Problem arose when the loop reached the bottom of the sheet.
First of all the bottom is always moving because rows are being deleted, then how to detect that the next cell is back at the top?

                            
Set Lookup = FirstSD
Set NextLookup = Range(Cells(1, 6), Cells(Bottom.Row, 6)).FindNext(Lookup)
Do
    LookupValue = Left(Lookup, 5)
    If Not NextLookup.Address <> FirstSD.Address Then
        Range(Lookup.Address, Bottom.Address).Offset(0, 1) = LookupValue
    Else
        Range(Lookup.Address, NextLookup.Address).Offset(0, 1) = LookupValue
    End If
    If Lookup.Address <> FirstSD.Address Then
        Range(Lookup.Address).EntireRow.Delete
    End If
    Set Lookup = NextLookup
    Set NextLookup = Range(Cells(1, 6), Cells(Bottom.Row, 6)).FindNext(Lookup)
Loop While Lookup.Address <> FirstSD.Address
LookupValue = Left(Lookup, 5)
Range(FirstSD.Address).EntireRow.Delete
                            
                        
Populate GD number

My solution was to use a Do While loop. This way I can detect if the Lookup address is the same as the NextLookup after all operations have been completed.
Other than that it just gets the range between the current GD and the next GD and populates that range of cells.

StyleFinal()

The last function sets the cell styles, font styles, and removes some spaces that are carried over from the origonal pdf file.

Summary

This is one of the largest programs I have written. This is largely do to the many individual caviots that each of the pdf exports brought in.
The end result was a well formatted couple of sheets that the estimator was able to use.